Task 2 -Data Preprocessing & Analytics¶

Data Preprocessing¶

My student ID is 22202363.

In this assignment I will collect business data and user reviews from the https://www.yelp.com/developers/documentation/v3/get_started for restaurants & bars in Dublin.

In [1]:
import pandas as pd
import numpy as np

#!pip install geopy
from geopy.geocoders import Nominatim


# Read the data file into a pandas data frame.

datapath = "raw.csv"
raw = pd.read_csv(datapath, index_col=0)
raw.head()
Out[1]:
id name category rating reviews price zipcode latitude longitude city address1 address2 address3 display_address
0 fR-pJ6nUn1bjPuT6lS2bsQ The Brazen Head pubs 4.0 739 €€ 8 53.344970 -6.276330 Dublin 20 Bridge Street Lower NaN NaN ['20 Bridge Street Lower', 'Dublin 8', 'Republ...
1 A-HzqcGJVTwHVFTVH_LlPA The Temple Bar pubs 4.0 550 €€ 2 53.345500 -6.264190 Dublin 47/48 Temple Bar Temple Bar NaN ['47/48 Temple Bar', 'Temple Bar', 'Dublin 2',...
2 rKvPQZcgjrQOLRU0phPoAQ Queen of Tarts desserts 4.5 511 €€ 2 53.344121 -6.267529 Dublin Cork Hill Dame Street NaN ['Cork Hill', 'Dame Street', 'Dublin 2', 'Repu...
3 _449xLONUU9nAUzCja2bNA The Porterhouse Temple Bar pubs 4.0 369 €€ 2 53.345100 -6.267550 Dublin 16-18 Parliament Street NaN NaN ['16-18 Parliament Street', 'Dublin 2', 'Repub...
4 -VIve-QeHR9-cKr7QldqtA Elephant & Castle tradamerican 4.0 345 €€ 2 53.345600 -6.262470 Dublin 18 Temple Bar NaN NaN ['18 Temple Bar', 'Dublin 2', 'Republic of Ire...
In [2]:
raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1734 entries, 0 to 1733
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1734 non-null   object 
 1   name             1734 non-null   object 
 2   category         1734 non-null   object 
 3   rating           1734 non-null   float64
 4   reviews          1734 non-null   int64  
 5   price            1425 non-null   object 
 6   zipcode          1604 non-null   object 
 7   latitude         1732 non-null   float64
 8   longitude        1732 non-null   float64
 9   city             1734 non-null   object 
 10  address1         1723 non-null   object 
 11  address2         487 non-null    object 
 12  address3         23 non-null     object 
 13  display_address  1734 non-null   object 
dtypes: float64(3), int64(1), object(10)
memory usage: 203.2+ KB

1.delete repeat rows¶

In [3]:
data1 = raw
data1.drop_duplicates(inplace = True)
data1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1539 entries, 0 to 1733
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1539 non-null   object 
 1   name             1539 non-null   object 
 2   category         1539 non-null   object 
 3   rating           1539 non-null   float64
 4   reviews          1539 non-null   int64  
 5   price            1248 non-null   object 
 6   zipcode          1417 non-null   object 
 7   latitude         1537 non-null   float64
 8   longitude        1537 non-null   float64
 9   city             1539 non-null   object 
 10  address1         1528 non-null   object 
 11  address2         428 non-null    object 
 12  address3         21 non-null     object 
 13  display_address  1539 non-null   object 
dtypes: float64(3), int64(1), object(10)
memory usage: 180.4+ KB

2.Convert price data format¶

In [4]:
# convert '€' to int(1),'€€' to int(2),'€€€' to int(3),'€€€€' to int(4)
data2 = data1
data2['price']=data2['price'].fillna('-')
In [5]:
for ind in data2.index:

    try:
        if data2['price'][ind] == '-':
            continue
        else:
            data2.loc[ind, 'price'] = len(data2['price'][ind])         
    except KeyError:
            print(n)
In [6]:
for ind in data2.index:
    try:
        if data2['price'][ind] == '-':
            data2.loc[ind, 'price'] = np.nan
        else:
            continue       
    except KeyError:
            print(n)

3.handle the missing latitude & longitude¶

In [7]:
data2[data2['latitude'].isnull()]
Out[7]:
id name category rating reviews price zipcode latitude longitude city address1 address2 address3 display_address
1267 34Sqd8zCW705lJTK4-YaJA The Back Page pubs 4.5 23 1 NaN NaN NaN Dublin 199 Phibsboro Road Phibsboro NaN ['199 Phibsboro Road', 'Phibsboro', 'Dublin', ...
1586 FEuyJ_bhika-QAU9DW78pg coppers face jack irish_pubs 4.0 1 NaN NaN NaN NaN Dublin NaN NaN NaN ['Dublin', 'Republic of Ireland']
In [8]:
# row 1267 has address1 which can be used for searching coordinates;
geolocator = Nominatim(user_agent="my_request")
location = geolocator.geocode("199 Phibsboro Road,Phibsboro,Dublin,Republic of Ireland")
print('纬度 = {}, 经度 = {}'.format(location.latitude, location.longitude))
data2.loc[1267,'latitude']=location.latitude
data2.loc[1267,'longitude']=location.longitude
纬度 = 53.3635526, 经度 = -6.272288
In [9]:
# row 1586 has no address information at all, and with only 1 review, so I decide to delete it
data2.drop(index=1586,inplace = True)
data2
Out[9]:
id name category rating reviews price zipcode latitude longitude city address1 address2 address3 display_address
0 fR-pJ6nUn1bjPuT6lS2bsQ The Brazen Head pubs 4.0 739 2 8 53.344970 -6.276330 Dublin 20 Bridge Street Lower NaN NaN ['20 Bridge Street Lower', 'Dublin 8', 'Republ...
1 A-HzqcGJVTwHVFTVH_LlPA The Temple Bar pubs 4.0 550 2 2 53.345500 -6.264190 Dublin 47/48 Temple Bar Temple Bar NaN ['47/48 Temple Bar', 'Temple Bar', 'Dublin 2',...
2 rKvPQZcgjrQOLRU0phPoAQ Queen of Tarts desserts 4.5 511 2 2 53.344121 -6.267529 Dublin Cork Hill Dame Street NaN ['Cork Hill', 'Dame Street', 'Dublin 2', 'Repu...
3 _449xLONUU9nAUzCja2bNA The Porterhouse Temple Bar pubs 4.0 369 2 2 53.345100 -6.267550 Dublin 16-18 Parliament Street NaN NaN ['16-18 Parliament Street', 'Dublin 2', 'Repub...
4 -VIve-QeHR9-cKr7QldqtA Elephant & Castle tradamerican 4.0 345 2 2 53.345600 -6.262470 Dublin 18 Temple Bar NaN NaN ['18 Temple Bar', 'Dublin 2', 'Republic of Ire...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1729 TVjY8fNyPccD2nNnKULtlw Bradys Bar pubs 3.0 1 3 6 53.309898 -6.283949 Dublin 5-9 Terenure Place NaN NaN ['5-9 Terenure Place', 'Dublin 6', 'Republic o...
1730 nAj5_-n6auPSWYHGsgL3_g Voici Creperie and Wine Bar creperies 1.0 1 NaN 6 53.321891 -6.266197 Rathmines 1A Rathgar Road NaN NaN ['1A Rathgar Road', 'Rathmines, 6', 'Republic ...
1731 MoTW58ukvPLNhf98U5i2hA Grangers pubs 5.0 1 NaN 5 53.392990 -6.214760 Coolock NaN NaN NaN ['Coolock, 5', 'Republic of Ireland']
1732 HkeaGMvNDqtiZ76ttjd8Kw Martins Lounge pubs 2.0 1 NaN 11 53.390630 -6.288100 Dublin 122 Ballygall Road W NaN NaN ['122 Ballygall Road W', 'Dublin 11', 'Republi...
1733 mKSywJhByCWlDIG00K6byw Comhaltas Ceoltóirí Éireann recording_studios 5.0 1 NaN NaN 53.289089 -6.207780 Stillorgan Cultúrlann na hÉireann, 32 Belgrave Square Monkstown NaN ['Cultúrlann na hÉireann, 32 Belgrave Square',...

1538 rows × 14 columns

4.Handle missing zipcode¶

In [10]:
data2[data2['zipcode'].isnull()]
Out[10]:
id name category rating reviews price zipcode latitude longitude city address1 address2 address3 display_address
37 jGWPezN-TLd8oQ9etopUqA Cafe Azteca mexican 4.5 144 2 NaN 53.349805 -6.260310 Dublin NaN NaN NaN ['Dublin', 'Republic of Ireland']
150 FSM9ID_UV9j0l4VL1YLkQw Goose On The Loose cafes 4.5 65 1 NaN 53.337524 -6.266099 Dublin 2 Kevin Street NaN NaN ['2 Kevin Street', 'Dublin', 'Republic of Irel...
234 t7uffDe-mgo2b4_4TR4OdQ Bow Lane cocktailbars 4.0 48 2 NaN 53.340270 -6.265540 Dublin 17 Aungier Street NaN NaN ['17 Aungier Street', 'Dublin', 'Republic of I...
280 sD87aoH4VezCI9ndUlx4pw PÓG salad 4.0 41 1 NaN 53.347415 -6.259912 Dublin 32 Bachelors Walk Dublin 1 NaN ['32 Bachelors Walk', 'Dublin 1', 'Dublin', 'R...
301 ZU3U1dx6-j9q6xvgpK21OQ Mykonos Taverna greek 4.0 38 2 NaN 53.344299 -6.266500 Dublin 76 Dame Street NaN NaN ['76 Dame Street', 'Dublin', 'Republic of Irel...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1712 sgClthFJpp4XD7nGg1ZUUg D'Arcy McGees pubs 4.0 4 2 NaN 53.299407 -6.309041 Templeogue Spawell Leisure Complex NaN NaN ['Spawell Leisure Complex', 'Templeogue, Co. D...
1717 y4nHvXDaChGCyGcNU_k97w Ciamei Cafe cafes 5.0 3 1 NaN 53.301400 -6.177730 Blackrock The Blackrock Market 19A Main Street NaN ['The Blackrock Market', '19A Main Street', 'B...
1721 AI623UyrkbjZSZS7RfNwJg Brasserie @ Dublin Airport irish 2.0 2 NaN NaN 53.396350 -6.128800 Dublin The Street Dublin Airport NaN ['The Street', 'Dublin Airport', 'Dublin', 'Re...
1726 kPxI4GAh769V1Xzlnx7ZDQ Brass Bar & Grill irish 4.0 1 NaN NaN 53.288170 -6.196170 Stillorgan Stillorgan Road NaN NaN ['Stillorgan Road', 'Stillorgan, Co. Dublin', ...
1733 mKSywJhByCWlDIG00K6byw Comhaltas Ceoltóirí Éireann recording_studios 5.0 1 NaN NaN 53.289089 -6.207780 Stillorgan Cultúrlann na hÉireann, 32 Belgrave Square Monkstown NaN ['Cultúrlann na hÉireann, 32 Belgrave Square',...

121 rows × 14 columns

In [11]:
#Use coordinates to search its zipcode

import geopy

def get_zipcode(df, geolocator, lat_field, lon_field):
    location = geolocator.reverse((df[lat_field], df[lon_field]))
    try:
        return location.raw['address']['postcode']
    except KeyError:
        return 'postcode not found' 
    
    #return location.raw['address']['postcode']

geolocator = geopy.Nominatim(user_agent='user_agents')
df = data2.loc[:,['latitude','longitude']]
#df.info()

zipcodes = df.apply(get_zipcode, axis=1, geolocator=geolocator, lat_field='latitude', lon_field='longitude')

5.Get the districts' names in Dublin from postcodes¶

In [12]:
data2['postcode']=zipcodes
data2.loc[data1['postcode']=='postcode not found']
data2['area']= data2['postcode'].str.slice(0, 3)
data2['area']
Out[12]:
0       D08
1       D01
2       D01
3       D01
4       D01
       ... 
1729    D06
1730    D06
1731    D05
1732    D11
1733    A94
Name: area, Length: 1538, dtype: object
In [13]:
#there is one row that has no zipcode, no price and only 1 review, so I decided to drop it
data3 = data2.drop(index=1536)

6.Normative the category names¶

In [14]:
data3.loc[data3['category'].str.contains('bar'),['category']] = 'bar'
data3.loc[data3['category'].str.contains('bistros'),['category']] = 'bar'
data3.loc[data3['category'].str.contains('pub'),['category']] = 'pub'

data3.loc[data3['category'].str.contains('creperies'),['category']] = 'dessert'
data3.loc[data3['category'].str.contains('dessert'),['category']] = 'dessert'
data3.loc[data3['category'].str.contains('dimsum'),['category']] = 'dessert'
data3.loc[data3['category'].str.contains('donuts'),['category']] = 'dessert'

data3.loc[data3['category'].str.contains('bagels'),['category']] = 'bakery'
data3.loc[data3['category'].str.contains('bakeries'),['category']] = 'bakery'
data3.loc[data3['category'].str.contains('cakeshop'),['category']] = 'bakery'


data3.loc[data3['category'].str.contains('ramen'),['category']] = 'japanese'
data3.loc[data3['category'].str.contains('sushi'),['category']] = 'japanese'

data3.loc[data3['category'].str.contains('szechuan'),['category']] = 'chinese'
data3.loc[data3['category'].str.contains('delicatessen'),['category']] = 'delis'
data3.loc[data3['category'].str.contains('tapasmallplates'),['category']] = 'tapas'
data3.loc[data3['category'].str.contains('vegetarian'),['category']] = 'vegan'

data3.loc[data3['category'].str.contains('coffee'),['category']] = 'cafe'
data3.loc[data3['category'].str.contains('cafe'),['category']] = 'cafe'

Data Saving¶

the geopy needs a lot time to get the zipcode, so I save the file to avoid recomputing again

In [15]:
data3.to_csv('preprocessed.csv', sep=',', header=True, index=True,float_format = str)

Data Analysis¶

In [17]:
import pandas as pd
import numpy as np
datapath = "preprocessed.csv"
data = pd.read_csv(datapath, index_col=0)
data.head()
Out[17]:
id name category rating reviews price zipcode latitude longitude city address1 address2 address3 display_address postcode area
0 fR-pJ6nUn1bjPuT6lS2bsQ The Brazen Head pub 4.0 739 2.0 8 53.344970 -6.276330 Dublin 20 Bridge Street Lower NaN NaN ['20 Bridge Street Lower', 'Dublin 8', 'Republ... D08 WC64 D08
1 A-HzqcGJVTwHVFTVH_LlPA The Temple Bar pub 4.0 550 2.0 2 53.345500 -6.264190 Dublin 47/48 Temple Bar Temple Bar NaN ['47/48 Temple Bar', 'Temple Bar', 'Dublin 2',... D01 E8P4 D01
2 rKvPQZcgjrQOLRU0phPoAQ Queen of Tarts dessert 4.5 511 2.0 2 53.344121 -6.267529 Dublin Cork Hill Dame Street NaN ['Cork Hill', 'Dame Street', 'Dublin 2', 'Repu... D01 E8P4 D01
3 _449xLONUU9nAUzCja2bNA The Porterhouse Temple Bar pub 4.0 369 2.0 2 53.345100 -6.267550 Dublin 16-18 Parliament Street NaN NaN ['16-18 Parliament Street', 'Dublin 2', 'Repub... D01 E8P4 D01
4 -VIve-QeHR9-cKr7QldqtA Elephant & Castle tradamerican 4.0 345 2.0 2 53.345600 -6.262470 Dublin 18 Temple Bar NaN NaN ['18 Temple Bar', 'Dublin 2', 'Republic of Ire... D01 E8P4 D01

1.Overall data summarisation¶

1) There are 1537 restaurants & bars in Dublin that I got from Yelp API ;

2) Their average price is 1.92(a medium state), which means near '€€'( the highest one is '€€€€');

3) The average user rating is 3.79(5 is the full score);

4) In the all 1537 businesses, there are 595 belong to bars and 942 belong to restaurants, but most bars also have the function of restaurants there;

5) Bars and reataurants are similar in user rating and price.

In [18]:
overall = pd.DataFrame()
overall.loc[0,'City']='Dublin'
overall['Num of restaurants&bars'] = len(data)
overall['Avg price of res&bar'] = round(data.price.mean(),2)
overall['Avg review cnt of res&bar'] = round(data.reviews.mean(),2)
overall['Avg score of res&bar'] = round(data.rating.mean(),2)

overall['Num of restaurants'] = len(data.loc[(data['category']!='bar') &(data['category']!='pub')])
overall['Avg price of res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')].price.mean(),2)
overall['Avg review cnt of res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')].reviews.mean(),2)
overall['Avg score of res'] = round(data.loc[(data['category']!='bar') &(data['category']!='pub')].rating.mean(),2)

overall['Num of bars'] = len(data.loc[(data['category']=='bar') | (data['category']=='pub')])
overall['Avg price of bars'] = round(data.loc[(data['category']=='bar') | (data['category']=='pub')].price.mean(),2)
overall['Avg review cnt of bars'] = round(data.loc[(data['category']=='bar') | (data['category']=='pub')].reviews.mean(),2)
overall['Avg score of bars'] = round(data.loc[(data['category']=='bar') | (data['category']=='pub')].rating.mean(),2)
overall
Out[18]:
City Num of restaurants&bars Avg price of res&bar Avg review cnt of res&bar Avg score of res&bar Num of restaurants Avg price of res Avg review cnt of res Avg score of res Num of bars Avg price of bars Avg review cnt of bars Avg score of bars
0 Dublin 1537 1.92 30.06 3.79 942 1.91 32.76 3.8 595 1.95 25.77 3.78

2.Districts data summarisation¶

In [19]:
areas = ['D01','D02','D03','D04','D05','D06','D07','D08','D09','D10','D11','D12','D13','D14','D15','D16',
         'D18','D20','D24','A94']
description = []

for area in areas:
    row = {"Areas": area}
    row['amount'] = len(data.loc[(data['area']==area)])
    row['avg_rating'] = round(data.loc[(data['area']==area),'rating'].mean(),2)
    row['avg_reviews'] = round(data.loc[(data['area']==area),'reviews'].mean(),2)
    row['avg_price'] = round(data.loc[(data['area']==area),'price'].mean(),2)
    
    row['amount_res'] = len(data.loc[(data['category']!='bar') &(data['category']!='pub')& (data['area']==area)])
    row['avg_rating_res'] = round(data.loc[(data['category']!='bar')  &(data['category']!='pub')& (data['area']==area),'rating'].mean(),2)
    row['avg_reviews_res'] = round(data.loc[(data['category']!='bar')  &(data['category']!='pub')& (data['area']==area),'reviews'].mean(),2)
    row['avg_price_res'] = round(data.loc[(data['category']!='bar')  &(data['category']!='pub')& (data['area']==area),'price'].mean(),2)
    
    row['amount_bar'] = len(data.loc[((data['category']=='bar') | (data['category']=='pub'))& (data['area']==area)])
    row['avg_rating_bar'] = round(data.loc[((data['category']=='bar') | (data['category']=='pub')) & (data['area']==area),'rating'].mean(),2)
    row['avg_reviews_bar'] = round(data.loc[((data['category']=='bar') | (data['category']=='pub')) & (data['area']==area),'reviews'].mean(),2)
    row['avg_price_bar'] = round(data.loc[((data['category']=='bar') | (data['category']=='pub')) & (data['area']==area),'price'].mean(),2)
    description.append(row)
    
description = pd.DataFrame(description).set_index("Areas")
description
Out[19]:
amount avg_rating avg_reviews avg_price amount_res avg_rating_res avg_reviews_res avg_price_res amount_bar avg_rating_bar avg_reviews_bar avg_price_bar
Areas
D01 387 3.77 33.74 1.83 266 3.74 34.25 1.80 121 3.84 32.62 1.94
D02 531 3.83 41.79 1.98 353 3.84 42.31 1.96 178 3.81 40.75 2.03
D03 36 3.61 10.36 1.77 18 3.50 12.67 1.88 18 3.72 8.06 1.64
D04 127 3.80 19.31 2.14 80 3.74 21.15 2.19 47 3.89 16.19 2.05
D05 13 3.31 4.38 1.62 4 3.38 6.00 1.75 9 3.28 3.67 1.50
D06 110 3.80 17.64 2.02 72 3.85 20.71 2.02 38 3.70 11.82 2.03
D07 92 3.79 20.48 1.68 50 3.79 23.76 1.58 42 3.79 16.57 1.83
D08 104 4.03 29.65 1.73 57 4.00 28.88 1.64 47 4.06 30.60 1.87
D09 37 3.81 11.70 2.00 18 3.83 12.78 2.25 19 3.79 10.68 1.73
D10 3 3.00 1.33 NaN 0 NaN NaN NaN 3 3.00 1.33 NaN
D11 8 3.44 2.00 2.00 0 NaN NaN NaN 8 3.44 2.00 2.00
D12 17 3.62 4.29 1.92 3 3.83 10.00 1.67 14 3.57 3.07 2.00
D13 5 2.70 23.40 1.50 2 2.75 57.00 2.00 3 2.67 1.00 1.00
D14 16 3.44 6.69 2.07 3 3.33 5.00 2.00 13 3.46 7.08 2.09
D15 3 3.33 4.33 2.00 0 NaN NaN NaN 3 3.33 4.33 2.00
D16 9 3.44 6.44 2.29 5 3.10 6.60 2.50 4 3.88 6.25 2.00
D18 2 4.00 8.00 2.00 1 5.00 15.00 2.00 1 3.00 1.00 NaN
D20 4 3.38 9.75 1.75 1 4.00 30.00 3.00 3 3.17 3.00 1.33
D24 5 2.50 3.40 1.50 1 1.00 1.00 NaN 4 2.88 4.00 1.50
A94 27 3.94 9.26 1.84 8 4.12 9.88 1.80 19 3.87 9.00 1.86

1) Geographical distribution of restaurants and bars¶

· In the map below, we can see that the Geographical distribution spreads from the city center to the suburban network along the main road of the city;

In [20]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go   #plotly地图类的库  graph_objects
# 在地图上画散点图
scatter = go.Scattermapbox(lat=data['latitude']      
                           ,lon=data['longitude'] ,
                           hovertext = ['name'],
                           hoverinfo = ['name']
                          )   
fig = go.Figure(scatter)       #将散点图导入画布
fig.update_layout(mapbox_style='open-street-map')      #将地图设置为画布
#可以使用的免费地图:"open-street-map", "carto-positron", "carto-darkmatter", "stamen-terrain", "stamen-toner" or "stamen-watercolor"
fig.show()

2) Number/avg_price/user_rating of restaurants in popular districts¶

· Restaurants and bars are mainly concentrated in districts 1 and 2 in the city centre;

· Also some in Dublin4,6,7,8, but rarely in other distrits;

· The average price in D04 is highest while lowest in D07 & D08;

· D08: the user rating is higher in D08 than other districs, so maybe in D08 there are some restaurants both cheap and tasty

In [22]:
popular_district = ['D01','D02','D04','D06','D07','D08']

m1 = description.loc[['D01','D02','D04','D06','D07','D08'],['amount','avg_rating','avg_price']]
m1[['avg_rating','avg_price']].plot(kind='bar')
m1['amount'].plot(colormap='Purples_r',kind='line',secondary_y=True)

ax = plt.gca()
ax.set_xticklabels(popular_district)

plt.show()
In [26]:
plt.figure(figsize=(10,5))
grouped = description.amount.sort_values(ascending=False)[:10]
sns.barplot(grouped.index, grouped.values, palette=sns.color_palette("GnBu_r", len(grouped)) )
plt.xlabel('Area', labelpad=10, fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Count of Restaurants by Area (Top 10)', fontsize=15)
plt.tick_params(labelsize=14)
plt.xticks(rotation=15)
for  i, v in enumerate(grouped):
    plt.text(i, v*1.02, str(v), horizontalalignment ='center',fontweight='bold', fontsize=14)
/Users/fengluyu/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

In [27]:
plt.figure(figsize=(10,5))
grouped = description.loc[description['amount']>10,'avg_price'].sort_values(ascending=False)
sns.barplot(grouped.index, grouped.values, palette=sns.color_palette("GnBu_r", len(grouped)) )
plt.xlabel('Area', labelpad=10, fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price of Restaurants & Bars in different areas', fontsize=15)
plt.tick_params(labelsize=14)
plt.xticks(rotation=15)
for  i, v in enumerate(grouped):
    plt.text(i, v*1.02, str(v), horizontalalignment ='center',fontweight='bold', fontsize=14)
/Users/fengluyu/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

3.Distribution of prices by restaurants¶

the price for Most restaurants in Dublin is '€€' & '€', which is a low to medium level.

In [28]:
plt.figure(figsize=(10,5))
grouped = data.price.value_counts().sort_index()
sns.barplot(grouped.index, grouped.values, palette=sns.color_palette("RdBu_r", len(grouped)))
plt.xlabel('price', labelpad=10, fontsize=14)
plt.ylabel('Count of restaurants', fontsize=14)
plt.title('Count of Restaurants against prices', fontsize=15)
plt.tick_params(labelsize=14)
for  i, v in enumerate(grouped):
    plt.text(i, v*1.02, str(v), horizontalalignment ='center',fontweight='bold', fontsize=14)
/Users/fengluyu/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

4.Count of Restaurants & Bars by Category¶

· Apart from bars&pubs, the most numerous types of restaurants in Dublin are:cafe, irish, italian, pizza, indpak...

· In addition to Western restaurants, there are also many East Asian restaurants, such as chinese, japanese, thai and so on. This means that Dublin might be a city with a more diverse food culture.

In [29]:
plt.figure(figsize=(10,5))
grouped = data.category.value_counts().sort_values(ascending=False)[:12]
sns.countplot(y='category',data=data, 
              order = grouped.index, palette= sns.color_palette("RdBu_r", len(grouped)))
plt.xlabel('Areas', labelpad=21, fontsize=14)
plt.ylabel('Amount', fontsize=14)
plt.title('Count of Restaurants & Bars by Category(Top 12)', fontsize=15)
for  i, v in enumerate(grouped):
    plt.text(i, v*1.02, str(v), horizontalalignment ='center',fontweight='bold', fontsize=14)

5.Top restaurants or bars selection¶

Firstly, we look through the businesses with top amount of reviews, which means many people have tried there.

It's easily to find that although some restaurants has many reviews, they are maybe negative reviews.

Those kind of low user rating restaurants are not what I need.

In [30]:
top_reviewed = data[['name','reviews','rating']].sort_values(by='reviews', ascending=False)[:10]
top_reviewed
Out[30]:
name reviews rating
0 The Brazen Head 739 4.0
1 The Temple Bar 550 4.0
2 Queen of Tarts 511 4.5
3 The Porterhouse Temple Bar 369 4.0
952 The Bank on College Green 369 4.5
4 Elephant & Castle 345 4.0
5 Cornucopia 334 4.5
6 Brother Hubbard 329 4.5
954 The Hairy Lemon 324 4.0
7 O'Neills Bar & Restaurant 281 3.5
In [31]:
plt.figure(figsize=(11,6))
grouped = data[['name','reviews']].sort_values(by='reviews', ascending=False)[:10]
sns.barplot(x=grouped.reviews, y = grouped.name, palette=sns.color_palette("GnBu_r", len(grouped)), ci=None)
plt.xlabel('Count of Review', labelpad=10, fontsize=14)
plt.ylabel('Restaurants', fontsize=14)
plt.title('TOP 9 restaurants with Most Reviews', fontsize=15)
plt.tick_params(labelsize=14)
plt.xticks(rotation=15)
for  i, v in enumerate(grouped.reviews):
    plt.text(v, i, str(v), fontweight='bold', fontsize=14)

So when we choose restaurant, we need to think about both review count and user rating.

1) Selct businesses that have more than 100 reviews and get user raiting >= 4.5, ranking by the review count and add the top 10 businessed in my wishlist.

2) Many reviews means more people tried this restaurant before, which makes user ratings more credible.

3) The 'top_business' below is what I plan to taste in the future

In [32]:
top_bussiness = data.loc[data['reviews']>=100, ['id','name','category','price','rating','reviews']].sort_values(by=['rating','reviews'], ascending=False)[:10]
top_bussiness_id = top_bussiness.id.tolist()
top_bussiness
Out[32]:
id name category price rating reviews
70 jO5EkqNn6IiypNfAAr_WfA Green Bench Café salad 1.0 5.0 100
2 rKvPQZcgjrQOLRU0phPoAQ Queen of Tarts dessert 2.0 4.5 511
952 dlClCiMV4Y8yTc9vCUCABw The Bank on College Green pub 2.0 4.5 369
5 ZdZNRZ1OdQ1MYfaK0vsbNw Cornucopia vegan 2.0 4.5 334
6 DM0Tcka4QpP4YqCfJ5nL1g Brother Hubbard mideastern 2.0 4.5 329
10 LG37RcSre8vSlS-5uJE2DA The Bakehouse bakery 1.0 4.5 261
956 bwhASCB14C2mlmctXcsKtA The Stag's Head pub 2.0 4.5 259
13 iNk7KmI1j-tfPGSNs6RXvg The Pig's Ear bar 3.0 4.5 223
18 cOpu16xeZUJNnhhUs71MJA L Mulligan Grocer pub 2.0 4.5 199
957 gVVBwMK1bd53VvT51XtPVQ Vintage Cocktail Club V.C.C bar 3.0 4.5 190

6.Get features from those top businesses¶

1)collecte reviews data¶

In [33]:
import json, requests
key = 'akX_q_pdbj6rh2xLXX35DTVNvqdD3T3F9gopk7Zqtp98hZF0gFEzGAdkZZEDwUU9o0iLsq0tOJN99o9eknmi8SvB1hz2FmFsQBzT9Oq9lVyLxyvACu72aem38OBWY3Yx'
headers = {'Authorization': 'bearer %s' % key}

reviewdata = {'review_id':[],'user_id':[],'business_id':[],'text':[],'datetime':[]}

for bussiness_id in top_bussiness_id: 
    url = 'https://api.yelp.com/v3/businesses/'+bussiness_id+'/reviews'
    response = requests.get(url,headers=headers)
    #print(response.json())    
    query = response.json()['reviews']

    for q in query:
        reviewdata['review_id'].append(q['id'])
        reviewdata['text'].append(q['text'])
        reviewdata['user_id'].append(q['user']['id'])
        reviewdata['business_id'].append(bussiness_id)
        #reviewdata['rating'].append(q['rating'])
        reviewdata['datetime'].append(q['time_created'])
        #result['useful'].append(q[''][''])
        #result['funny'].append(q[''][''])
        #result['cool'].append(q[''][''])
          
reviewdata = pd.DataFrame(reviewdata)
In [34]:
reviewdata.to_csv('yelp_review.csv', sep=',', header=True, index=True,float_format = str)

2)load review data¶

In [36]:
reviewpath = "yelp_reviews.csv"
reviews = pd.read_csv(reviewpath, index_col=0)
reviews.head()
Out[36]:
user_id user_name datetime business_id business_name text
num
0 fTqlWcqiFIVNrfXbF6C2mw Peter W. 30/4/2022 jO5EkqNn6IiypNfAAr_WfA Green Bench Café Charming location, charming service, delicious...
1 d_TBs6J3twMy9GChqUEXkg Jennifer O. 26/6/2021 jO5EkqNn6IiypNfAAr_WfA Green Bench Café I ate here precovid, I tried ir br ket s wich...
2 8DwIFAcbzhwnZdd7LC_JuQ Brad D. 24/6/2022 jO5EkqNn6IiypNfAAr_WfA Green Bench Café Just Perfect Food w h a smile.  What else coul...
3 lWcyfDKDlHSk3yclJ-tkiw Mark G. 9/8/2019 jO5EkqNn6IiypNfAAr_WfA Green Bench Café Yummy healthy food.  Great soups.  Vegan, Ve a...
4 LfgC6aypR9dnH6oSceMP9g JI X. 17/9/2019 jO5EkqNn6IiypNfAAr_WfA Green Bench Café Th st s wich I've ever had. Although 's...

3)Prepreocess text data¶

In [37]:
import re
import pandas as pd
import nltk
import collections
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

corpus = []

for ids in top_bussiness_id:
    corpus.append(reviews['text'][reviews['business_id']==ids].tolist()) 
In [38]:
docs=['','','','','','','','','','']
for n in range(len(corpus)):
    #print('n = ',n)
    for j in range(len(corpus[n])):
        #print('j = ',j)
        docs[n]= docs[n]+corpus[n][j]        
In [39]:
stopwordss = set(stopwords.words('english'))
tokenizer = nltk.tokenize.RegexpTokenizer('\w+') 
symbol = ['!','"','#','$','%','&','\\','(',')','*','+',',','-','.','/','_','?']
stopwordsss = ['the','it','is','you','and','have','get','this','also','are','to','be','was','little']


docs1  = []
docs2  = []
docs3  = []
docs4  = []

for doc in docs:
    docs1.append(doc.lower())

for doc in docs1:
    doc = tokenizer.tokenize(doc)
    docs2.append(doc)

for doc in docs2:
    #print(doc)
    for word in doc:
        #print(word)
        if ((word.isnumeric()==True)| (word in symbol)):
            doc.remove(word)
        else:
            continue
    docs3.append(doc)

for doc in docs3:
    for word in doc:
        #print(word)
        if ((word in stopwordss) | (word in stopwordsss)):
            doc.remove(word)
        else:
            continue
    docs4.append(doc)
In [40]:
corpus = ['','','','','','','','','','']
for t in range(len(corpus)):
    for word in docs4[t]:
        corpus[t]= corpus[t] +','+word

4)Get the feature words of every restaurant using wordcloud¶

1) Use 'The Bank on College Green' as an example:

we can easily get the feature words for 'The Bank on College Green' are 'drink','dinner','bar', with 'great service' & 'atmosphere'.

In [41]:
top_bussiness.loc[top_bussiness['name']=='The Bank on College Green']
Out[41]:
id name category price rating reviews
952 dlClCiMV4Y8yTc9vCUCABw The Bank on College Green pub 2.0 4.5 369
In [42]:
#!pip install wordcloud
from wordcloud import WordCloud
from wordcloud import ImageColorGenerator
wordcloud = WordCloud( background_color="white").generate(corpus[2])
plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

2) Use 'Queen of Tarts' as the other example

we can get different feature words for 'Queen of Tarts' are 'breakfast','cake','coffee',with 'sweet' & 'delicious' taste ,which is totally different with 'The Bank on College Green'.

In [43]:
top_bussiness.loc[top_bussiness['name']=='Queen of Tarts']
Out[43]:
id name category price rating reviews
2 rKvPQZcgjrQOLRU0phPoAQ Queen of Tarts dessert 2.0 4.5 511
In [44]:
wordcloud = WordCloud( background_color="white").generate(corpus[1])
plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Conclusion¶

In this task,

  1. I have a preliminary understanding of the geographical distribution, catering type, price, etc. of Dublin's restaurants and bars;

  2. Screening out 10 popular and high -scoring restaurants and bars, adding to my food list, after the plan, and after the plan Friends to try together;

  3. By analyzing the text of user reviews, we generally understand the characteristics of these restaurants. Some are delicious desserts, some are good for the bar, and some are good at the atmosphere of dinner. I will gradually try according to my own life needs. These 10 restaurants.

Two challenges appeared in the process of using Yelp API,

  1. Yelp Business only allows a maximum of 1,000 pieces of data, which means that if I need to repeat the data multiple times to obtain all restaurants in the city;

  2. Yelp Reviews only returns a fixed 3 user evaluation in a single number, and the three evaluations are not enough to do natural language analysis. After I was on Google, I re -used the Beautiful Soup on the Yelp official website to capture every evaluation. This is very bad. First of all, a new crawler task consumes a lot of time. If you come again, I will choose a project that can directly obtain data from the API. Secondly, the data obtained by crawlers is not as complete as the API. These are two very different data acquisition methods, which may not be suitable for very rigorous data analysis. For example, a store records 300 evaluations in Yelp Business, but only 286 reptiles have been captured.

At present, I found that Yelp has offline complete datasets for download and analysis, which can solve the problem that complete data cannot be obtained from the api. However, downloading offline datasets does not meet our requirements for this assignment.

If there is a chance to try it later,

I can do more in-depth text analysis from the complete and massive offline dataset, such as understanding what the keywords are for people’s positive and negative reviews of some restaurants, so as to know which strengths the restaurant needs to maintain and which weaknesses it needs to improve.

I've had several favorite restaurants (delicious food) that went out of business due to poor management (low publicity, poor service, poor location, etc.). This makes me very sad because it's so hard to find new alternative restaurants. If I come across a restaurant I like very much in Dublin, I may start from this angle and give the restaurant some management advice, hoping that they can maintain a good taste and operate for a long time.